Introduction
A basic task in analysing RNA-seq count data is to detect differentially expressed genes (DEGs). A gene is differentially expressed if the change in the read counts between two experimental conditions is statistically significant. It is possible to perform differential expression analysis with the DESeq2 package. Its differential expression tests are based on a negative binomial generalised linear model.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.3.6 âś” purrr 0.3.4
## âś” tibble 3.1.8 âś” dplyr 1.0.9
## âś” tidyr 1.2.0 âś” stringr 1.4.0
## âś” readr 2.1.2 âś” forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## Loading required package: S4Vectors
##
## Loading required package: stats4
##
## Loading required package: BiocGenerics
##
##
## Attaching package: 'BiocGenerics'
##
##
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
##
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
##
##
## Attaching package: 'S4Vectors'
##
##
## The following objects are masked from 'package:dplyr':
##
## first, rename
##
##
## The following object is masked from 'package:tidyr':
##
## expand
##
##
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
##
## Loading required package: IRanges
##
##
## Attaching package: 'IRanges'
##
##
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
##
## The following object is masked from 'package:purrr':
##
## reduce
##
##
## The following object is masked from 'package:grDevices':
##
## windows
##
##
## Loading required package: GenomicRanges
##
## Loading required package: GenomeInfoDb
##
## Loading required package: SummarizedExperiment
##
## Loading required package: MatrixGenerics
##
## Loading required package: matrixStats
##
##
## Attaching package: 'matrixStats'
##
##
## The following object is masked from 'package:dplyr':
##
## count
##
##
##
## Attaching package: 'MatrixGenerics'
##
##
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
##
##
## Loading required package: Biobase
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
##
##
## Attaching package: 'Biobase'
##
##
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
##
##
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
##
## Loading required package: ggrepel
DESeq2
Importing Data
The first step is importing data. An RNA-seq count data should be a non-normalised sequence read counts at either the gene or transcript level presented as a table. The rows are the detected genes, and the columns are the number of sequence fragments assigned to each gene for each sample.
count_data <- read_delim(file = "data/count_data.txt") # import count data## Rows: 24341 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): tracking_id, locus
## dbl (8): 3_Leukemic_FPKM, 5_Leukemic_FPKM, 6_Leukemic_FPKM, 7_Leukemic_FPKM,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
count_data <- count_data[, -6]
colnames(count_data)[3:9] <- gsub("_FPKM", "", colnames(count_data)[3:9])
colnames(count_data)[3:9] <- gsub("-", "_", colnames(count_data)[3:9])Make a DESeqDataSet Object
The next step is to create an object of class DESeqDataSet, which
will store the counts and intermediate calculations we need for the
differential expression analysis. We use
DESeqDataSetFromMatrix(couvst_dfata, colData, design) for
this aim. The first argument is the coun_data. The input should be a
matrix, including integer values. Let’s take a look at our count
data.
str(count_data)## tibble [24,341 Ă— 9] (S3: tbl_df/tbl/data.frame)
## $ tracking_id : chr [1:24341] "0610005C13Rik" "0610007P14Rik" "0610009B22Rik" "0610009L18Rik" ...
## $ locus : chr [1:24341] "chr7:45567794-45589710" "chr12:85815454-85824545" "chr11:51685384-51688634" "chr11:120348677-120351190" ...
## $ 3_Leukemic : num [1:24341] 0.01 31.07 14.6 1.07 32.44 ...
## $ 5_Leukemic : num [1:24341] 0.0898 37.7451 14.9858 1.8318 31.9144 ...
## $ 6_Leukemic : num [1:24341] 0.01 30.57 12.16 1.89 23.98 ...
## $ 9_pre_leukemic : num [1:24341] 0.01 31.821 11.763 0.655 30.516 ...
## $ 8_pre_leukemic : num [1:24341] 0.01 34.45 15.24 1.13 29.91 ...
## $ 10_pre_leukemic: num [1:24341] 0.01 33.28 12.11 1.06 32.7 ...
## $ 11_pre_leukemic: num [1:24341] 0.01 27.26 14.64 1.06 33.12 ...
The first two columns are characters. Other columns are numeric. And it is a data frame. So, we need to convert it to a matrix. But we have mentioned in the previous session that the data type of matrix values should be the same. We need to convert them to integer values. First, we remove character columns and then we round the numeric values.
# adding gene names as rownames
count_data <- tibble::column_to_rownames(count_data, var = "tracking_id")
head(count_data)## locus 3_Leukemic 5_Leukemic 6_Leukemic
## 0610005C13Rik chr7:45567794-45589710 0.0100 0.0898211 0.0100
## 0610007P14Rik chr12:85815454-85824545 31.0667 37.7451000 30.5707
## 0610009B22Rik chr11:51685384-51688634 14.6030 14.9858000 12.1555
## 0610009L18Rik chr11:120348677-120351190 1.0670 1.8318000 1.8923
## 0610009O20Rik chr18:38238404-38262629 32.4397 31.9144000 23.9829
## 0610010B08Rik chr2:175185164-175338212 0.0100 0.0100000 0.0100
## 9_pre_leukemic 8_pre_leukemic 10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik 0.010000 0.01000 0.01000 0.01000
## 0610007P14Rik 31.820800 34.45450 33.27710 27.26090
## 0610009B22Rik 11.762600 15.23670 12.10740 14.63560
## 0610009L18Rik 0.655317 1.12876 1.06295 1.06027
## 0610009O20Rik 30.516100 29.90590 32.69850 33.12450
## 0610010B08Rik 0.010000 0.01000 0.01000 0.01000
# removing the character column
count_data <- count_data[, -1]
# convert the count_data to a matrix
count_data_mat <- as.matrix(count_data)
# round the values to get rid of the decimal part
count_data_rounded <- round(count_data_mat, digits = 0)
head(count_data_rounded)## 3_Leukemic 5_Leukemic 6_Leukemic 9_pre_leukemic 8_pre_leukemic
## 0610005C13Rik 0 0 0 0 0
## 0610007P14Rik 31 38 31 32 34
## 0610009B22Rik 15 15 12 12 15
## 0610009L18Rik 1 2 2 1 1
## 0610009O20Rik 32 32 24 31 30
## 0610010B08Rik 0 0 0 0 0
## 10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik 0 0
## 0610007P14Rik 33 27
## 0610009B22Rik 12 15
## 0610009L18Rik 1 1
## 0610009O20Rik 33 33
## 0610010B08Rik 0 0
The second argument is colData. A colData is a data frame containing metadata about each sample. Each column should contain categorical values. It means you need to convert them to factor. The priority of the levels of factor variable is important. Here, pre_leukemic should be the reference. If we had other samples types like normal, the priority should be this sequence: normal, pre-leukemic, leukemic.
# creating colData
colnames(count_data) # check column names## [1] "3_Leukemic" "5_Leukemic" "6_Leukemic" "9_pre_leukemic"
## [5] "8_pre_leukemic" "10_pre_leukemic" "11_pre_leukemic"
# the first column vector
sample_type <- c(rep("leukemic", 3), rep("pre_leukemic", 4))
sample_type <- factor(sample_type, levels = c("pre_leukemic", "leukemic"))
# the second column vector
gender <- c(rep(c("female", "male"), times = 3), "female")
gender <- factor(gender, levels = c("female", "male"))
colData <- data.frame(sample_type = sample_type,
gender = gender,
row.names = colnames(count_data))
colData## sample_type gender
## 3_Leukemic leukemic female
## 5_Leukemic leukemic male
## 6_Leukemic leukemic female
## 9_pre_leukemic pre_leukemic male
## 8_pre_leukemic pre_leukemic female
## 10_pre_leukemic pre_leukemic male
## 11_pre_leukemic pre_leukemic female
The third argument is design. The DESeqDataSet object store the design formula used to estimate dispersion and log2 fold changes used within the model. Specifying the formula should take the form of a “~” followed by “+” separating factors.
dds <- DESeqDataSetFromMatrix(countData = count_data_rounded,
colData = colData,
design = ~ sample_type + gender)## converting counts to integer mode
# View(dds)You can see that dds stores the count data in assays, the design
formula in the design, and the colData in the colData. It is impossible
to access the count data directly from a DESeqDataSet object. We need to
use assay().
head(assay(dds))## 3_Leukemic 5_Leukemic 6_Leukemic 9_pre_leukemic 8_pre_leukemic
## 0610005C13Rik 0 0 0 0 0
## 0610007P14Rik 31 38 31 32 34
## 0610009B22Rik 15 15 12 12 15
## 0610009L18Rik 1 2 2 1 1
## 0610009O20Rik 32 32 24 31 30
## 0610010B08Rik 0 0 0 0 0
## 10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik 0 0
## 0610007P14Rik 33 27
## 0610009B22Rik 12 15
## 0610009L18Rik 1 1
## 0610009O20Rik 33 33
## 0610010B08Rik 0 0
Differential Expression Analysis
Pre-filtering
Genes with very low counts don’t lead to reliable differential expression results. We need to do some light pre-filtering. This will reduce the size of the DEseq2DataSet object and speed up the algorithm’s runtime.
Here we are filtering the genes with a sum total of 10 reads on all samples.
nrow(dds)## [1] 24341
dds <- dds[rowSums(assay(dds)) > 10, ]
nrow(dds)## [1] 10486
The DESeq()
We use DESeq() to perform differential expression
analysis. It performs some steps, including an outlier removal
procedure. This function performs a default analysis through these below
steps:
- estimation of size factors
- estimation of dispersion
- Negative Binomial GLM fitting and Wald statistics
dds <- DESeq(object = dds)## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
dds## class: DESeqDataSet
## dim: 10486 7
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(10486): 0610007P14Rik 0610009B22Rik ... a l7Rn6
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(7): 3_Leukemic 5_Leukemic ... 10_pre_leukemic 11_pre_leukemic
## colData names(3): sample_type gender sizeFactor
Shrinkage the Results
The estimates of differences between sample groups calculated by
DESeq() are not corrected for expression level. When counts
are small, it leads to large differences between sample groups. We can
fix this by applying a “shrinkage” procedure.
For this aim, we use lfcShrink(), but first we need to
know the name and/or the position of the “coefficient”^1
that is calculated by DESeq(). We can use
resultsNames(). The resultsNames() returns the
names of the model’s estimated effects (coefficients).
resultsNames(dds)## [1] "Intercept"
## [2] "sample_type_leukemic_vs_pre_leukemic"
## [3] "gender_male_vs_female"
We are interested in comparing pre-leukemic and leukemic samples. It is the second element.
shrunken_res <- lfcShrink(dds = dds,
# the coefficient we want to reestimate
coef = 2,
# We will use `ashr` for estimation
type = "ashr")## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
class(shrunken_res)## [1] "DESeqResults"
## attr(,"package")
## [1] "DESeq2"
head(shrunken_res)## log2 fold change (MMSE): sample type leukemic vs pre leukemic
## Wald test p-value: sample type leukemic vs pre leukemic
## DataFrame with 6 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 0610007P14Rik 32.15434 0.01240045 0.117411 0.7285111 0.999930
## 0610009B22Rik 13.65484 0.00355700 0.135085 0.9127776 0.999930
## 0610009O20Rik 30.55839 -0.00709896 0.117273 0.8406072 0.999930
## 0610010F05Rik 7.25915 0.01115646 0.151015 0.7077363 0.999930
## 0610010K14Rik 61.84799 0.01081064 0.105673 0.7589358 0.999930
## 0610011F06Rik 22.80726 0.12788310 0.243674 0.0326866 0.488071
Here we describe what each column means:
baseMean: The average of the normalised count values, divided by size factors, taken over all samples.
log2FoldChange: This value indicates how much the gene or transcript’s expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2.
lfcSE: Gives the standard error of the log2FoldChange.
pvalue: P-value of the test for the gene or transcript.
padj: Adjusted P-value for multiple testing for the gene or transcript.
The summary() helps us to see the results briefly. You
can set the alpha argument to 0.05. This is the value of
the adjusted p-value.
summary(shrunken_res, alpha = 0.05)##
## out of 10486 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 194, 1.9%
## LFC < 0 (down) : 40, 0.38%
## outliers [1] : 0, 0%
## low counts [2] : 407, 3.9%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Missing P-values
Let’s check if there is any missing adjusted p-value.
shrunken_res[is.na(shrunken_res$padj), ]## log2 fold change (MMSE): sample type leukemic vs pre leukemic
## Wald test p-value: sample type leukemic vs pre leukemic
## DataFrame with 407 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 0610040B10Rik 1.70388 -0.01150762 0.206223 0.662990 NA
## 1110006O24Rik 1.56898 0.01088935 0.207915 0.679145 NA
## 1110020A21Rik 1.56561 0.01369499 0.212707 0.608400 NA
## 1500009L16Rik 1.70209 -0.03435321 0.281815 0.277075 NA
## 1500015A07Rik 2.13119 -0.00206597 0.184789 0.936103 NA
## ... ... ... ... ... ...
## Zfp97 1.56632 0.01361175 0.210612 0.610362 NA
## Zkscan16 1.84455 -0.02299480 0.238024 0.421149 NA
## Zkscan7 1.84410 -0.02371225 0.240433 0.409602 NA
## Zscan12 1.56424 0.00325981 0.199363 0.898953 NA
## Zxdb 1.84835 -0.00396092 0.191899 0.877713 NA
Why do we have genes with no adjusted p-value?
Adjusted p-values can be set to NA values if all samples have zero counts within a row, if a row contains a sample with an extreme count outlier, or if a row is filtered because it has a low mean normalised count.
Significant Genes
The genes with adjusted p-value < 0.05 are statistically significant. We are going to select them.
shrunk_res_sig <- subset(shrunken_res, shrunken_res$padj <= 0.05)Exploring the Results
Volcano Plot
p <- EnhancedVolcano(shrunken_res,
lab = NA,
x = 'log2FoldChange', y = 'padj',
FCcutoff = 1, pCutoff = 0.05,
title = 'Leukemic vs Pre-leukemic',
legendPosition = "top")
ggsave(filename = "plots/not_labeled_volcano.jpg",
plot = p,
dpi = 600,
units = "in",
height = 8,
width = 8)We can name the genes and add a theme layer to the plot.
p <- EnhancedVolcano(shrunken_res,
lab = rownames(shrunken_res),
labSize = 3,
x = 'log2FoldChange', y = 'padj',
FCcutoff = 1, pCutoff = 0.05,
title = 'Leukemic vs Pre-leukemic') +
theme_get()
ggsave(filename = "plots/labeled_volcano.jpg",
plot = p,
dpi = 600,
units = "in",
height = 5,
width = 8)Heatmap
Before visualising count data, we need to transform it. The variance stabilising transformation or VST is the most accurate and fastest way.
vst_obj <- vst(dds, blind = FALSE)
vst_df <- as.data.frame(assay(vst_obj))Static
We will use pheatmap() from the pheatmap package to plot
the static heatmap. We will visualise the top 30 genes based on the
adjusted p-value.
# sort shrunken_res by adjusted p-value ascending
shrunken_res <- shrunken_res[order(shrunken_res$padj),]
# select the top 15 up-regulated
top_15_up <- subset(shrunken_res, shrunken_res$padj <= 0.05 & shrunken_res$log2FoldChange > 0)
top_15_up <- top_15_up[1:15, ]
# select the top 15 down-regulated
top_15_down <- subset(shrunken_res, shrunken_res$padj <= 0.05 & shrunken_res$log2FoldChange < 0)
top_15_down <- top_15_down[1:15, ]
# bind them by row
top_30 <- rbind(top_15_up, top_15_down)
# select top 30 from vst_df
top_30 <- subset(vst_df, rownames(vst_df) %in% rownames(top_30))
p <- pheatmap(top_30,
scale = "row",
cluster_cols=TRUE, cluster_rows = TRUE, treeheight_row = 10,
show_rownames = TRUE,
main = "Expression of Top 30 Genes")# ggsave helps us to save our plots in jpg, pdf, png, and other types of images format
ggsave(filename = "plots/static_heatmap.jpg",
plot = p,
dpi = 600,
units = "in",
height = 8,
width = 5)Let’s change the colour of the heatmap. Here are some of the colour pallets in the RColorBrewer package.
display.brewer.all(colorblindFriendly = TRUE)# We use RdBu
col_pal <- colorRampPalette(rev(brewer.pal(9, "RdBu")))(200)
# see the result
plot(1:length(col_pal), 1:length(col_pal), col = col_pal, cex = 5, pch=19)We selected the new colour for the plot. Let’s add an annotation bar for the samples.
p <- pheatmap(top_30,
col=col_pal,
scale = "row",
cluster_cols=TRUE, cluster_rows = TRUE, treeheight_row = 10,
show_rownames = TRUE,
annotation_col = colData, annotation_names_col = TRUE,
main = "Expression of Top 30 Genes")ggsave(filename = "plots/static_annotated_heatmap.jpg",
plot = p,
dpi = 600,
units = "in",
height = 8,
width = 5)Interactive
To plot the interactive heatmap, we use d3heatmap() from
the d3heatmap package. First, we need to cluster the rows and
columns.
# clustering
clustRows <- hclust(as.dist(1-cor(t(top_30), method="pearson")), method="single")
clustColumns <- hclust(as.dist(1-cor(top_30, method="pearson")), method="single")
plot(as.dendrogram(clustColumns))module.assign <- cutree(clustRows, k = 10)
# plotting
p <- d3heatmap(top_30,
colors = col_pal,
Rowv=as.dendrogram(clustRows),
Colv = as.dendrogram(clustColumns),
scale='row')
# sevewidget() from htmlwidgets helps us to save HTML files
htmlwidgets::saveWidget(p,
file= "plots/interactive_heatmap.html")
pPrincipal Component Analysis (PCA)
PCA is an unsupervised method, similar to clustering, to find
patterns. It simplifies the complexity of high-dimensional data. You can
perform PCA with prcomp().
# we perform PCA based on all of the genes
# Use plotPCA, but return the data for custom plotting
p <- plotPCA(vst_obj,
ntop = nrow(vst_df),
intgroup = "sample_type")
ggsave(filename = "plots/pca_not_labeled.jpg",
plot = p,
dpi = 600)## Saving 7 x 5 in image
Let’s add labels on the points and title.
p <- plotPCA(vst_obj,
ntop = nrow(vst_df),
intgroup = "sample_type") +
geom_text(aes(label = colnames(vst_obj)),
hjust = "right",vjust = "right", angle = 45) +
labs(title = "PCA") +
xlim(-9, 9) + ylim(-9, 9) # it is important to see pca plots in a square panel
ggsave(filename = "plots/pca_labeled.jpg",
plot = p,
dpi = 600)## Saving 7 x 5 in image
Acknowledgement
I would like to thank my supervisor Dr Gregor Reid, Dr Ali Farrokhi, and Tanmaya Atre, for their support and guidance. Also, I would like to thank Reid’s lab members and all of the people who helped me to get through my project.